Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve array interface of Raster for consistency with np.ma.masked_array operations #302

Merged
merged 19 commits into from
Sep 13, 2022

Conversation

rhugonnet
Copy link
Contributor

@rhugonnet rhugonnet commented Sep 6, 2022

Summary

This PR adds a nodata-aware array interface to work with Raster objects and their subclasses. All of the ~90 universal functions of NumPy are now supported (see ufuncs: https://numpy.org/doc/stable/reference/ufuncs.html), and also ~25 NumPy array functions defined on a per-need basis (such as np.min, np.max, np.mean, np.median, np.std, np.var, np.percentile, np.quantile). All these functions are wrapped in a way that accounts for possible nodata in Raster.data.

From __array_interface__ to __array_ufunc__ and __array_function__

In #210, we added __array_interface__ as a property to perform operations on Raster classes as if on array. However, this interface points to the array of the masked array .data.data, and therefore gives inconsistent results depending on nodata values.
For example:

np.min(raster.data) # minimum value is around 340, nodata value is -9999
Out[1]: 341.2

np.min(raster) 
Out[2]: -9999

This PR tries to expand the addition of @erikmannerfelt to apply to the masked array (Note: there is a mask key in the dict of __array_interface__, but I didn't understand how this is supposed to work, see https://numpy.org/doc/stable/reference/arrays.interface.html).
Instead, this PR uses __array__, __array_ufunc__ and __array_function__ methods of NumPy, that allow to define the comportement of the Raster class when functions are applied to it, respectively, np.array(), any universal functions ufunc, and any array function function. See https://numpy.org/doc/stable/user/basics.dispatch.html#basics-dispatch.

Choosing the behaviour of NumPy functions on Raster objects

Important note to start: I did not find a way to define the comportement of np.ma functions (which are not covered by __array_function__ of NumPy, and no equivalent seem to exists in the ma module). So this PR only concerns NumPy functions with the format np.function.

For defining behaviour, I thought the best option would be to enforce the exact same behaviour as when applying to the masked array .data. NumPy function generally account for masks even if not called from np.ma. However, there are a couple exceptions that ignore the mask (raising a UserWarning), those are:

  • np.median
  • np.quantile
  • np.percentile

And so, because of those three, I thought preferable to make the behaviour of NumPy functions on Raster always nodata-aware. This means that the behaviour is:

  1. All NumPy functions except the 3 listed above are cast to the masked array Raster.data. Then, if the function preserves shape (all ufuncs), the output masked array is encapsulated in a Raster object.
  2. For "median", the NumPy function is redefined into np.ma.median to be nodata-aware and cast on the masked array .data,
  3. For "quantile" and "percentile", no np.ma function exists, so the NumPy function is left the same and cast on valid values .data.data[~.data.mask]

Seeing how NumPy has made most classic functions outside of np.ma account for mask, it could be only a matter of years until they add np.median, np.percentile and np.quantile. Then the behaviour of our array interface would become fully consistent without the need for specific cases.

For documentation

In the documentation (once we set it up a bit better), we would recommend three options for array calculations, that always remain fully consistent:

  1. Applying any np.function to the Raster object,
  2. Applying any np.ma.function to the masked array .data of the Raster object,
  3. Applying any np.nanfunction to Raster.get_nanarray() once implemented (see Add a get_nan_array method to the Raster class. #277).

Tests

A specific test class TestArrayInterface is added.
It tests all combinations of:

  • np.function (all ufuncs and supported array funcs),
  • dtype supported by Raster (possibly two loops for two inputs),
  • nodata, either None or the default value of a dtype (possibly two nodata for two inputs),

which are applied to Raster object with a 5x5 masked array with random values of the dtype, and with a mask containing at least one True and one False value.

For all combinations, the test check that the output of the NumPy function applied to Raster.data is fully consistent with the same function applied to Raster, at the exception of the special case of "median", "percentile" and "quantile" checked separately.

Additionally, the PR checks that NumPy TypeError or ValueError are always raised through application to Raster when supposed to, i.e. for a wrong combination of input dtypes, and that this error behaviour is fully consistent with the one for the function being applied to Raster.data.

The many combinations increases our number of test reported by pytest by about 13,500. This is a lot but helps tracing quickly exactly which combination is failing (which occured quite a bit during the drafting phase of the PR). If we feel it's too much, we could move some of these test loops inside the test_ functions instead of as @pytest.mark.parametrize loops.

Little addition

I have uncommented tests on reflective arithmetic operations with 3D np.ndarrays that use to fail due to #256. I have also added 2D tests and updated the _overloading_check function to enable #304.

Linked issues

Resolves #261
Resolves #256
Resolves #304

@rhugonnet rhugonnet marked this pull request as draft September 6, 2022 19:17
if ufunc.nout == 1:
return self.__class__(
{
"data": getattr(ufunc, method)(inputs[0].data, **kwargs), # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you define the type of the output (I assume np.ndarray) rather than ignoring mypy complain?

Copy link
Contributor Author

@rhugonnet rhugonnet Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of the output is already defined for __array_ufunc__ and __array_function__ in the function definition.

Mypy still complains about something else specific to those lines, I think it's the indexing of inputs or something similar. I couldn't figure it out so I silenced it.

tests/test_georaster.py Outdated Show resolved Hide resolved
Copy link
Member

@adehecq adehecq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from a few minor comments, it looks all good. Thanks for fixing this !!
Just of curiosity, did you check how this is done in pandas?

@rhugonnet
Copy link
Contributor Author

I checked how this is recommended by NumPy (https://numpy.org/doc/stable/user/basics.dispatch.html#basics-dispatch), which is how it's typically done for dask arrays or cupy arrays (pandas does not encapsulate np.ndarray, so it's not exactly the same use cases).

@adehecq
Copy link
Member

adehecq commented Sep 13, 2022

I checked how this is recommended by NumPy (https://numpy.org/doc/stable/user/basics.dispatch.html#basics-dispatch), which is how it's typically done for dask arrays or cupy arrays (pandas does not encapsulate np.ndarray, so it's not exactly the same use cases).

Ok cool, it's nicely explained on the NumPy doc indeed !

@rhugonnet rhugonnet merged commit 96c432c into GlacioHack:main Sep 13, 2022
@rhugonnet rhugonnet deleted the improve_array_interface branch September 13, 2022 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants